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Abstract 

Hierarchical latent class (HLC) models are tree-structured Bayesian networks where 
leaf nodes are observed while internal nodes are latent. There are no theoretically well 
justified model selection criteria for HLC models in particular and Bayesian networks with 
latent nodes in general. Nonetheless, empirical studies suggest that the BIC score is a 
reasonable criterion to use in practice for learning HLC models. Empirical studies also 
suggest that sometimes model selection can be improved if standard model dimension is 
replaced with effective model dimension in the penalty term of the BIC score. 

Effective dimensions arc difficult to compute. In this paper, we prove a theorem that 
relates the effective dimension of an HLC model to the effective dimensions of a number 
of latent class models. The theorem makes it computationally feasible to compute the 
effective dimensions of large HLC models. The theorem can also be used to compute the 
effective dimensions of general tree models. 

1. Introduction 

Hierarchical latent class (HLC) models (Zhang, 2002) are tree-structured Bayesian networks 
(BNs) where leaf nodes are observed while internal nodes are latent. They generalize latent 
class models (Lazarsfeld and Henry, 1968) and were first identified as a potentially useful 
class of Bayesian networks by Pearl (1988). We are concerned with learning HLC models 
from data. A fundamental question is how to select among competing models. 

The BIC score (Schwarz, 1978) is a popular metric that researchers use to select among 
Bayesian network models. It consists of a loglikelihood term that measures the fitness 
to data and a penalty term that depends linearly upon standard model dimension, i.e. 
the number of linearly independent standard model parameters. When all variables are 
observed, the BIC score is an asymptotic approximation of (the logarithm) of the marginal 
likelihood (Schwarz, 1978). It is also consistent in the sense that, given sufficient data, the 
BIC score of the generative model — the model from which data were sampled — is larger 
than those of any other models that are not equivalent to the generative model. 

When latent variables are present, the BIC score is no longer an asymptotic approx- 
imation of the marginal likelihood (Geiger et al, 1996). This can be remedied, to some 
extent, using the concept of effective model dimension. In fact if we replace standard model 
dimension with effective model dimension in the BIC score, the resulting scoring function, 
called the BICe score, is an asymptotic approximation of the marginal likelihood almost 
everywhere except for some singular points (Rusakov and Geiger, 2002). 
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Neither BIC nor BICe have been proved to be consistent for latent variable models. As 
a matter of fact, it has not even been defined what it means for a model selection criterion 
to be consistent for latent variable models. Empirical studies suggest that the BIC score is 
well-behaved in practice for the task of learning HLC models. There are three related search- 
based algorithms for learning HLC models, namely double hill-climbing (DHC) (Zhang, 
2002), single hill-climbing (SHC) (Zhang et al, 2003), and heuristic SHC (HSHC) (Zhang, 
2003). In the absence of a theoretically well justified model selection criterion, Zhang (2002) 
tested DHC with four existing scoring functions, namely the AIC score (Akaike, 1974), the 
BIC score, the Cheeseman-Stutz (CS) score (Cheeseman and Stutz, 1995), and the holdout 
logarithmic score (HLS)(Cowell et al, 1999). Both real-world and synthetic data were used. 
On the real-world data, BIC and CS have enabled DHC to find models that are regarded as 
the best by domain experts. On the synthetic data, BIC and CS have enabled DHC to find 
models that either are identical to or resemble closely the true generative models. When 
coupled with AIC and HLS, on the other hand, DHC performed significantly worse. SHC 
and HSHC were tested on synthetic data sampled from fairly large HLC models (as much 
as 28 nodes). Only BIC was used in those tests. In all cases, BIC has enabled SHC and 
HSHC to find models that either are identical to or resemble closely the true generative 
models. Those empirical results not only indicate that the algorithms perform well, but 
also suggest that the BIC is a reasonable scoring function to use for learning HLC models. 

The experiments also reveal that model selection can sometimes be improved if the BICe 
score is used instead of the BIC score. We will explain this in detail in Section 3 

In order to use the BICe score in practice, we need a way to compute effective dimen- 
sions. This is not a trivial task. The effective dimension of an HLC model is the rank of 
the Jacobian matrix of the mapping from the parameters of the model to the parameters 
of the joint distribution of the observed variables. The number of rows in the Jacobian 
matrix increases exponentially with the number of observed variables. The construction of 
the Jacobian matrix and the calculation of its rank are both computationally demanding. 
Moreover they have to be done algebraically or with very high numerical precision to avoid 
degenerate cases. The necessary precision grows with the size of the matrix. 

Settimi and Smith (1998, 1999) studied effective dimensions for two classes of models: 
trees with binary variables and latent class (LC) models with two observed variables. They 
have obtained a complete characterization of these two classes. Geiger et al. (1996) com- 
puted the effective dimensions of a number of models. They conjectured that it is rare for 
the effective and standard dimensions of an LC model to differ. As a matter of fact, they 
found only one such model. Kocka and Zhang (2002) found quite a number of LC models 
whose effective and standard dimensions differ. They also proposed an easily computable 
formula for estimating effective dimensions of LC models. The estimation formula has been 
empirically shown to be very accurate. 

In this paper, we prove a theorem that relates the effective dimension of an HLC model 
to the effective dimensions of two other HLC models that contain fewer latent variables. 
Repeated application of the theorem allows one to reduce the task of computing the effective 
dimension of an HLC model to subtasks of computing effective dimensions of LC models. 
This makes it computationally feasible to compute the effective dimensions of large HLC 
models. 
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We start in Section 2 with a formal definition of effective dimensions for Bayesian net- 
works witli latent variables. In Section 3, we provide empirical evidence that suggest the use 
of BICe instead of BIC sometimes improves model selection. Section 4 presents the main 
theorem and Section 5 is devoted to the proof of the theorem. In Section 6, we prove a the- 
orem about effective dimensions of general tree models and explain how this and our main 
theorem allows one to compute the effective dimension of arbitrary tree models. Finally, 
concluding remarks are provided in Section 7. 

2. Effective Dimensions of Bayesian Networks 

In this paper, we use capital letters such as X and Y to denote variables and lower case 
letters such as x and y to denote states of variables. The domain and cardinality of a 
variable X will be denoted by Vlx and \X\ respectively. Bold face capital letters such as Y 
denote sets of variables. Q.y denotes the Cartesian product of the domains of all variables 
in the set Y. Elements of VLy will be denoted by bold lower case letters such as y and will 
sometimes be referred to as states of Y. We will consider only variables that have a finite 
number of states. 

Consider a Bayesian network model M that possibly contains latent variables. The 
standard dimension ds{M) of M is the number of linearly independent parameters in the 
standard parameterization of M. The parameters denote, for each variable and each parent 
configuration of the variable, the probability that the variable is in some state (except one) 
given the parent configuration. Suppose M consist of k variables xi, X2, . . . , Xk- Let rj and 
Qi be respectively the number of states of Xi and the number of all possible combinations of 
the states of its parents. If Xi has no parent, let qi be 1. Then ds{M) is given by 

k 

ds{M)=Y,qi{n-l). 

i=l 

For notational simplicity, denote the standard dimension of M by n. Let 9={9i, 62, ■ ■ ■ , On) 
be a vector of n linearly independent model parameters of M. Further let Y be the set of 
observed variables. Suppose Y has m+\ possible states. We enumerate the first m states 
as yi, yi, ..., y„,. 

For any i {l<i<m), P{yi) is a function of the parameters 6. So we have a mapping from 
the n dimensional parameter space (a subspace of i?") to K^, namely T : {61,62, ■■■ , 6n) l~ 
{P{yi),P{y2), ■ ■ ■ ,P{ym))- The Jacobian matrix of this mapping is the following ra^n 
matrix: 

JM{e) = [-H = ^^^^1 



For convenience, we will often write the matrix as Jm = [ gg. ]; with the understanding 
that elements of the j-th column are obtained by allowing Y run over all its possible states 
except one. 

For each i, P{yi) is a function of 6. For most commonly used parameterizations of 
Bayesian networks, it is actually a polynomial function of 0. Hence we make the following 
assumption: 
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Assumption 1 The Bayesian network M is so parameterized that the parameters for the 
joint distribution of the observed variables are polynomial functions of the parameters for 
M. 

An obvious consequence of the assumption is that elements of Jm are also polynomial 
functions of 6. 

For a given value of 9, Jm is a matrix of real numbers. Due to Assumption 1, the rank 
of this matrix is some constant d almost everywhere in the parameter space (Geiger et al, 
1996. Also see Section 5.1.). To be more specific, the rank is d everywhere except in a set 
of measure zero where it is smaller than d. The constant is called the regular rank of Jm ■ 

The regular rank of Jm is also called the effective dimension of the Bayesian network 
model M. Hence we denote it by de{M). To understand the term "effective dimension", 
consider the subspace of i?™ spanned by the joint probability -P(Y) of observed variables, 
or equivalently the range of the mapping T. The term reflects the fact that, for almost every 
value of 6, a small enough open ball around T{6) resembles Euclidean space of dimension d 
(Geiger et al, 1996). 

There are multiple ways to parameterize a given Bayesian network model. However, the 
choice of parameterization does not affect the space spanned by the joint probability i-*(Y). 
Together with the interpretation of the previous paragraph, this implies that the definition 
of effective dimension does not depend on the particular parameterization that one uses. 

3. Selecting among HLC Models 

A hierarchical latent class (HLC) model is a Bayesian network where (1) the network struc- 
ture is a rooted tree and (2) the variables at the leaf nodes are observed and all the other 
variables are not. The observed variables are sometimes referred to as manifest variables 
and all the other variables as latent variables. Figure 1 shows the structures of two HLC 
models. A latent class (LC) model is an HLC model where there is only one latent variable. 

The theme of this paper is the computation of effective dimensions of HLC models. As 
mentioned in the introduction, this is interesting because effective dimension, when used in 
the BIG score, gives us a better approximation of the marginal likelihood. In this section, 
we give an example to illustrate that the use of effective dimension sometimes also leads to 
better model selection. We will also motivate and introduce the concept of regularity that 
will be used in subsequent sections. 

3.1 An Example of Model Selection 

Consider the two HLC models shown in Figure 1. In one experiment, we instantiated the 
parameters of Mi in a random fashion and sampled a set Di of 10,000 data records on the 
observed variables. Then we ran SHC and HSHC on the data set Di under the guidance 
of the BIG score. Both algorithms produced model M2. In the following, we explain why, 
based on Di , one would prefer M2 over Mi if BIG is used for model selection and why Mi 
would be preferred if BICe is used instead. We argue that Mi should be preferred based on 
Di and hence BICe is a better scoring metric for this case. 



Effective Dimensions of HLC Models 





Ml 



Mo 



Figure 1: Two HLC models. The shaded variables are latent, while the other variables are 
observed. The cardinality of Xi is 2, while cardinalities of all other variables are 
3. 



The BIC and BICe scores of a model M given a data set D are defined as follows: 
BIC{M\D) 



logP{D\M,9*) - ^^^^logN, 



BICe{M\D) = logP{D\M,e*)-'^^^logN 

where 6* is the maximum likelihood estimate of the parameters of M based on D and N is 
the sample size. 

In our example, notice that M2 includes Mi in the sense that M2 can represent any 
probability distributions of the observed variables that Mi can. In fact, if we make the 
conditional probability distributions of the observed variables in M2 the same as in Mi and 
set Pm2{^2) and Pm2{X?,\^2) such that 



Pm2{X2)Pm2{X3\X2) 



J2 Pm, {Xi)Pm, {X2\Xi)Pm, {X^\Xi), 

Xi 



then the probability distribution of the observed variables in the two models are identical. 
Because M2 includes Mi, we have logP{Di\Mi,ei) < logP{Di\M2,9^). Together with 
the fact that Di is sampled from Mi, this implies that logP{Di\Mi, 91) ~ logP{Di\M2, 62) 
for sufficiently large enough sample size. The standard dimension of Mi is 45, while that 
of M2 is 44. Hence 

BIC{Mi\Di) < BIC{M2\Di). 

On the other hand, the effective dimensions of Mi and M2 are 43 and 44 respectively. Hence 

BICe{Mi\Di) > BICe{M2\Di). 

Model M2 includes Mi. The opposite is clearly not true because the effective dimension 
of Ml is smaller than that of M2. So, M2 is in reality a more complex model than Mi. Both 
model fit data Di equally well. Hence the simpler one, i.e. Mi, should be preferred over 
the other. This agrees with the choice of the BICe score, while disagrees with the choice of 
the BIC score. Hence, BICe is more appropriate than BIC in this case. 
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3.2 Regularity 

Now consider another model M{ that is the same as Mi except that the cardinahty of Xi 
is increased from 2 to 3. It is easy to show that M2 includes M[ and vice versa. So, the two 
models are equivalent in terms of their capabilities of representing probability distributions 
of the observed variables. They are hence said to be marginally equivalent. However, M[ 
has more standard parameters than M2 and hence we would always prefer M2 over M[. To 
formalize this consideration, we introduce a concept of regularity. 

For a latent variable Z in an HLC model, enumerate its neighbors (parent and children) 
as Xi, X2, . . . , X/g. An HLC model is regular if for any latent variable Z, 

maxJ-^-i^ \Xi\ 

and the strict inequality holds when Z has two neighbors and at least one of them is a 
latent node. Models Mi and M2 are regular, while model M[ is not. 

For any irregular model M there always exists a regular model that is marginally equiv- 
alent to M and has fewer standard parameters (Zhang, 2003b). The regular model can 
be obtained from M as follows: For any latent node that has only two neighbors and its 
cardinality is no smaller than that of one of the neighbors, then remove the latent node and 
connect the two neighbors. For any latent node that has more than two neighbors and that 
violates (1), reduce it's cardinality to the quantity on the right hand side. Repeat both 
steps until no more changes can be made. 

It is also interesting to note that the collection of all regular HLC models for a given set 
of observed variables is finite (Zhang, 2002). This provides a finite search space for the task 
of learning regular HLC models.^ In the rest of this paper, we will consider only regular 
HLC models. 

Before ending this subsection, we point out a nice property of effective model dimension 
in relation to model inclusion. If an HLC model includes another model, then its effective 
dimension is no less than that of the latter. As a consequence, two marginally equivalent 
models have the same effective dimensions and hence the same BICe score. The same is 
not true for standard model dimension and the BIC score. 

3.3 The CS and CSe Scores 

We have argued on empirical grounds that the BIC score is a reasonable scoring function 
to use for learning HLC models and that the BICe score can sometimes improve model 
selection. But the two scores are not free of problems. One problem is that their derivation 
as Laplace approximations of the marginal likelihood are not valid at the boundary of the 
parameter space. The CS score in a way alleviates this problem. It involves the BIC score 
based on completed data and the BIC score based on original data. In other words, it 
involves two Laplace approximations of the marginal likelihood. It lets errors in the two 
approximation cancel each other. 

Chickering and Heckerman (1997) empirically found the CS score to be a quite accurate 
approximation of the marginal likelihood and robust at the boundary of the parameter 



1. The definition of regularity given in this paper is slightly different from the one given in Zhang (2002). 
Nonetheless, the two conclusions mentioned in this paragraph remain true. 
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Figure 2: Problem reduction. 

space. They realized the need for effective model dimension in the CS score, although they 
did not actually use it. This would not have made any differences to their experiments 
because, for the models they used, the standard and effective dimensions agree. 

We use CSe to refer to the scoring function one obtains by replacing standard model 
dimension in the CS score with effective model dimensions. Just as BICe is better than 
BIC as approximations of the marginal likelihood (Geiger et al, 1996), CSe is better than 
CS. To compute CSe, we also need to calculate effective dimensions. 

4. Effective Dimensions of HLC Models 

As we have seen, effective model dimension is interesting for a number of reasons. Our 
main result in this paper is a theorem about the effective dimension de{M) of a regular 
HLC model M that contains more than one latent variable. Let X be the root of M, which 
is a latent node. Because there are at least two latent nodes, there must exist another latent 
node Z that is a child of X. In the following, we will use the terms X -branch and Z -branch 
to respectively refer to the sets of nodes that are separated from Z by X or from X by Z. 
Let Y be the set of observed variables in the Z-branch and let O be the set of all other 
observed variables. Note that the X-branch doesn't contain the node X. The relationship 
among X, Z, Y, and O is depicted in the left- most picture of Figure 2. 

The standard parameterization of M includes parameters for P{X) and parameters for 
P{Z\X). For convenience, we replace those parameters with parameters for P{X^Z). As 
mentioned at the end of Section 2, such reparameterization does not affect the effective 
dimension de[M). To reflect the reparameterization, the edge between X and Z is not 
directed in Figure 2. 

Suppose -P(X, Z) has k^ parameters 9\ , 62 ■, ■ ■ ■ , Oj, . Suppose the conditional distri- 
butions of variables in the X-branch consists of ki parameters Oi , ^2 ' • • • ' "k ^^'^ ^^^ 

(2) (2) 
conditional distributions of variables in the Z-branch consists of k2 parameters Oi , ^2 ' 

(2) 
. . . , 0j^ . For convenience we will sometimes refer to those three groups of parameters using 

three vectors 6^^', 6^^' and 6^'^' respectively. 

In the following, we will define two other HLC models Mi and M2 starting from M and 

establish a relationship between their effective dimensions and the effective dimension of 

M. In this context, M, Mi, and M2 are regarded purely as Mathematical objects. The 

semantics of their variables are of no concern. In particular, a variable H that is latent 
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Figure 3: The picture on the left shows an HLC model with five observed and five latent 
variables, each variable is annotated by its name and its cardinality. The picture 
on the right shows the components we can decompose the HLC model into by 
applying Theorem 1. Latent variables are shaded, while observed variables are 
not. 



in M might be designated to be observed in Mi or M2 as part of the definition of those 
Mathematical objects. 

We obtain a Bayesian network model Bi from M by deleting the Z-branch. Strictly 
speaking Bi is not Bayesian network due to the parameterization it inherits from M: instead 
of probability tables P{X) and P{Z\X), we have table P{X, Z). But P{X) and P{Z\X) can 
readily be obtained from P{X,Z). With this in mind, we view Bi as a Bayesian network. 
This network is obviously tree-structured. It's leaf variables include those in the set O and 
the variable Z. We define Mi to be the HLC model that share the same structure as Bi 
and where the variable Z and all the variables in O are observed. The parameters of Mi 
are ^) and ^(1). 

Similarly let B2 be the Bayesian network model obtained from M by deleting the X- 
branch. It is a tree-structure and its leaf variables include those in Y and the variable X. 
We define M2 to be the HLC model that share the same structure as B2 and where the 
variable X and all the variables in Y are observed. The parameters of M2 are 9^^> and O^'^K 

Theorem 1 Suppose M is a regular HLC model that contains two or more latent nodes. 
Then the two HLC models Mi and M2 defined in the text are also regular. Moreover, 



de{M) = de{Mi)+de{M2)-[ds{Mi)+ds{M2)-ds{M)]. 



(2) 



In words, the effective dimension of M equals the sum of the effective dimensions of Mi 
and M2 minus the number of common parameters that Mi and M2 share. 

To appreciate the significance of this theorem, consider the task of computing the ef- 
fective dimension of a regular HLC model that contains two or more latent nodes. By 
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repeatedly applying the theorem, we can reduce the task into subtasks of calculating effec- 
tive dimensions of LC models. As an example, consider the HLC model depicted by the 
picture on the left in Figure 3. Theorem 1 allows us to, for the purpose of computing its 
effective dimension, decompose the HLC model into five LC models, which are shown on 
the right in Figure 3. 

How might one compute the effective dimension of an LC model? One way is to use 
the algorithm suggested by Geiger et al. (1996). The algorithm first symbolically computes 
the Jacobian matrix, which is possible due to Assumption 1. Then it randomly assigns 
values to the parameters, resulting a numerical matrix. The rank of the numerical matrix 
is computed by diagonalization. Because the rank of Jacobian matrix equals the effective 
dimension of the LC model almost everywhere, we get the regular rank with probability 
one. This algorithm has recently been implemented by Rusakov and Geiger (2003). Kocka 
and Zhang (2002) suggest an alternative algorithm that computes an upper bound. The 
algorithm is fast and has been empirically shown to produce extremely tight bounds. 

Going back to our example, the effective dimension of the LC models for Xi, X2, X3, X4 
and X^ are 26, 23, 23, 34 and 17 respectively. Thus the effective dimension of the HLC model 
inFigure3is26+23+34+23+17-(5*3-l)-(3*6-l)-(6*3-l)-(3*5-l) = 61. In contrast, 
the standard dimension of the model is 5+6*2+6*2+6*2+3*4-1-5*5-1-5-1-3*4-1-5*2-1-5 = 110. 

5. Proof of Main Result 

This section is devoted to the proof of Theorem 1. We begin with some properties of 
Jacobian matrices of Bayesian network models. 

5.1 Properties of Jacobian Matrices 

Consider the Jacobian matrix Jm of a Bayesian network model M. It is a matrix parame- 
terized by the parameters 9 of M. Let fi, W2) • • • ; Vm be column vectors of Jm- 

Lemma 1 A number of column vectors vi, V2, ■■■, Vm of the Jacobian matrix Jm c,re 
either linearly dependent everywhere or linearly independent almost everywhere. They are 
linearly dependent everywhere if and only if there exists at least one column vector Vj that 
can be expressed as a linear combination of other column vectors everywhere. 

Proof: Consider diagonalizing the following transposed matrix: 

[vi,V2,...,Vm]'^- 

According to Assumption 1, elements of the matrix are polynomials (of 9). Hence we would 
multiply rows with polynomials or fraction of polynomials. Of course, we need also to add 
one row to another row. At the end of the process, we get a diagonal matrix whose nonzero 
elements are polynomials or fractions of polynomials. Suppose there are k nonzero rows 
and suppose they correspond to vi, ^2, • . . , Wfc- 

Because elements of the diagonalized matrix are polynomials or fractions of polynomials, 
they are well-defined ^ and nonzero almost everywhere (i.e. for almost all values of 9). If 
k=m, then the m vectors are linearly independent of each other almost everywhere. 



2. A fraction is not well defined if the denominator is zero. 
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If k<m, there exist, for each j {k<j<m), polynomials or fractions of polynomials Cj 
(l<i<A;) such that 

k 

Vj = ^CiVi. (3) 

The coefficients Cj's can be determined by tracing the diagonalization process. So Vj can be 
expressed as a linear combination of {vi\i = 1, . . . ,k} everywhere ^. □ 

Although it might sound trivial, this lemma is actually quite interesting. This is because 
Jm is a parameterized matrix. The first part, for example, implies that there do not exist 
two subspaces of the parameter space that both have nonzero measures such that the m 
vectors are linearly independent in one subspace while linearly dependent in the other. 

If m is the total number of column vectors of Jm, we get the following lemma: 

Lemma 2 In the Jacobian matrix Jm, there exists a collection of column vectors that form 
a basis of its column space almost everywhere. The number of vectors in the collection 
equals to the regular rank of the matrix. Moreover, the collection can be chosen to include 
any given set of column vectors that are linearly independent almost everywhere. 

Proof: The first part has already been proved. The second part follows from the definition 
of regular rank. The last part is true because we could start the diagonalization process 
with the transpose of the vectors in the set on the top of the matrix. □ 

5.2 Proof of Theorem 1 

We now set out to prove Theorem 1. It is straightforward to verify that the HLC models 
Ml and M2 are regular. So it suffices to prove equation (2). This is what we do in the rest 
of this section. 

The set of observed variables in M is O U Y, the set of observed variables in Mi is 
O U {Z} and the set of observed variables in M2 is Y U {X}. Hence the Jacobian matrices 
of models M, Mi, and M2 can be respectively written as follows: 



Jm 

Jmi 
Jm2 



dP{0,Y) dP{0,Y) dP{0,Y) dP{0,Y) dP{0,Y) dP{0,Y) 

^ <) '■••' <) ' <) '■■•' <) ' 5^f) '•■■' <) ^ 

dP{0,Z) dP{0,Z) dP{0,Z) dP{0,Z) 

^^^^'•■■'^^^'^^'••■'^^^ 
.dP{X,Y) dP{X,Y) dP{X,Y) dP{X,Y)^ 



<) '•■■' <) ' de?^ '■••' <) 



3. There is a subtle point here. Being fractions of polynomials of 0, the Ci's might be undefined for some 
values of 9. So from equation (3) alone, we cannot conclude that Vj linearly depends on {vi\i = 1, . . . , fc} 
everywhere. 

The conclusion is nonetheless true for two reasons. First the set of 6 values where the Ci's are 
undefined has measure zero. Second, if Vj does not linearly depend on {vi\i — 1, . . . ,k} at one value of 
6, then the same would be true in a sufficiently small and nonetheless measure-positive ball around that 
value. 

10 
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It is clear that there is a one-to-one correspondence between the first k^+ki column vectors 
of Jm with the column vectors of Jmi and there is a one-to-one correspondence between 
the first fco and the last /c2 column vectors of Jm with the column vectors of Jm2 ■ We will 
first show 

Claim 1: The first feo vectors of Jm (Jmi or JM2) ^^^ linearly independent 
almost everywhere. 

Together with Lemma 2, Claim 1 implies that there is a collection of column vectors in 
Jmi that includes the first ko vectors and that is a basis of the column space of Jmi almost 
everywhere. In particular, this implies that de{Mi)>ko. Suppose de(Mi)=fco+r. Without 
loss of generality, suppose the basis vectors are 

dP{o,z) dP{o,z) dP{o,z) dP{o,z) 
<) '■■•' <) ' d9['^ '••■' dei'^ ■ ^^ 

By symmetry, we can assume that de{M2)=kQ+s where s>0 and that the following column 
vectors form a basis for Jm2 almost everywhere: 



dP{X,Y) dP{X,Y) dPiX,Y) dPiX,Y) 

<) '■■•' <) ' 5ef) '••■' 5# ■ 

Now consider the following list of vectors in Jm '■ 

dP{0,Y) dP{0,Y) dP{0,Y) dP{0,Y) dP{0,Y) dP{0,Y) 



(5) 



<) '■■•' <) ' de^^^ '•■■' 5# ' def '••■' 5# 



(6) 



We will show 



Claim 2: All column vectors of Jm linearly depend on the vectors listed in (6) 
everywhere. 

Claim 3: The vectors listed in (6) are linearly independent almost everywhere. 

Those two claims imply that the vectors listed in (6) form a basis of the column space of 
Jm almost everywhere. Therefore 

de{M) = ko+r+s = de{Mi)+de{M2)-ko. 

It is clear that kQ=ds{Mi)+ds{M2)—ds{M). Therefore Theorem 1 is proved. □ 

5.3 Proof of Claim 1 

Lemma 3 Let Z be a latent node in an HLC model M and Y be the set of the observed 
nodes in the subtree rooted at Z. If M is regular, then we can set conditional distributions 
of nodes in the subtree in such a way that they encode an injective mapping p from Qz to 
Qy in the sense that P{Y=p{z)\Z=z) = 1 for all z £ Viz- 

11 
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Proof: We prove this lemma by induction on the number of latent nodes in the subtree 
rooted at Z. First consider the case when there is only one latent node, namely Z . In this 
case, Z is the parent of all nodes in Y. Enumerate all these nodes as li, I2, • • • , ^- Because 
M is regular, we have \Z\ < nj=i 1^1- Hence we can define an injective mapping p from 
^z to JIy = nj=i ^Yt- For each state z of Z, p{z) can be written as y = {yi,y2, ■ ■ ■ ,yk), 
where yi is a state of YJ. Now if we set 

PiYi=yi\Z=z) = 1, 

then P{Y=p{z)\Z=z) = 1. 

Now consider the case when there are at least two hidden nodes in the subtree rooted 
at Z. Let W be one such latent node that has no latent node descendants. Let Y*-^-* be 
the set of observed nodes in the subtree rooted at W and Y(^)=Y\Y(^''. By the induction 
hypothesis, we can parameterize the subtree rooted at W in such a way that it encodes an 
injective mapping from 0,^ to f^YCi)- Moreover, if all nodes below W are removed from M, 
M remains a regular HLC model. In that model, we can parameterize the subtree rooted at 
Z in such a way that it encodes an injective mapping from Qz to ^(wY('^y) ~ ^w x ^y(2)- 
Together, those two facts prove the lemma. □ 



Corollary 1 Let Z be a latent node in an HLC model M . Suppose Z have a latent neighbor 
X. Let Y be the set of the observed nodes separated from X by Z . If M is regular, then 
we can set probability distributions of nodes separated from X by Z in such a way that they 
encode an injective mapping p from Qz to Q,y in the sense that P(Y=p{z)\Z=z) = 1 for 
all z G Qz- 

Proof: The corollary follows readily from Lemma 3 and the property of the root-walking 
operation (Zhang, 2002). □ 



(7) 



Proof of Claim 1: Consider the following matrix 

Because 6*^ , ^2 ' • • • ; ^L ^^^ ^^^ parameters for the joint distribution P{X, Z), this matrix 
is the identity matrix if the rows are properly arranged. So its column vectors are linearly 
independent almost everywhere. 

Now consider the first /cq column vectors of Jm'- dP{0,Y)/d9\ , . . . , dP{0,Y)/d6\,\ 
They must be linearly independent almost everywhere. If not, one of the vectors, say 
dP{0,Y)/d9f^ , would linearly depend on the rest everywhere according to Lemma 1. 
Observe that for any i (l<i<A;o), 

Choose P{0\X) and P{Y\Z) as in Corollary 1. The vector dP{0,Y)/def^ might con- 
tain zero elements. If we remove the zero elements, what remains of the vector is iden- 
tical to dP{X,Z)/d6\ . So we can conclude that dP{X,Z)/d9\, linearly depends on 
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dP{X, Z)/d9\ . . . , dP{X, Z)/d9\. _-^ everywhere, which contradicts the conclusion of the 
previous paragraph. Hence the first k^ vectors of Jm must be hnearly independent almost 
everywhere. 

It is evident that, using similar arguments, we can also show that the first k^ vectors of 
Jmx {JM2) ^^6 linearly independent almost everywhere. Claim 1 is therefore proved. □ 

5.4 Proof of Claim 2 

Every column vector of Jm^ linearly depends on vectors listed in (4) everywhere. Observe 
that 



9P(0,Y) >r-o.^,^^9P(0,Z) 



Y.P{Y\Zr-±^^^,^ = l,...,k 



de['^ Y ^ ' ' d9['^ 







Therefore every column vector of Jm that corresponds to vectors in Jm^ linearly depends 
on the first ko+r vectors listed in (6) everywhere. 

By symmetry, every column vector of Jm that corresponds to vectors in J^a linearly 
depends on the first fco and the last s vectors listed in (6) everywhere. The claim is proved. 

D 

5.5 Proof of Claim 3 

We prove this claim by contradiction. Assume the vectors listed in (6) were not linearly 
independent almost everywhere. According to Lemma 1, one of them, say f , must linearly 
depend on the rest everywhere. Because of Claim 1 and Lemma 2, we can assume that v is 
among the last r+s vectors. Without loss of generality, we assume that v is dP{0, Y)/d6ii . 
Then for any value of 9, there exist real numbers Cj (l<«<A;o), q {^<i^r), and c\ 
(l<i<s— 1) such that 

dP{0,Y) _^ dP{0,Y) ^ (i) dP{0,Y) -'^ (2) dP{0,Y) 
5# k"' d9f ^U' d9f^ ^U' d9f ■ 

Note that in the last term on the right hand side, i runs from 1 to s— 1. 

The parameter vector 9 consists of three subvectors 9^^'^ 9^^' and 9^'^'. Set the parameters 
^(1) (for the X-branch) as in Lemma 3. Then there exists an injective mapping p from Vlx 
to r^o such that 

P(0=p(x)[X=x) = 1 for ahx G 17x. (8) 

For each of the vectors in (6), consider the subvector consisting only of elements for 
those states of O that are the images of states of X under the mapping p. Such subvectors 
win be denoted by dP{Ox,Y)ld9f\ dP{Ox,Y)/d9f\ and dP{Ox,Y)/d9f\ For any 
values of 9^^> and 9^'^\ we still have 
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dP{Ox,Y) _^ dP{Ox,Y) ' ^i) dP{Ox,Y) '^ ^2) dP{Ox,Y) 

Consider the first two terms on the right hand side: 
^ dP{Ox.Y) , ^ (i) aP(Ox,Y) 

- E^^EnY\zfJ^,±.^^j:PiY\zf-^^ 

Z i=l O^i i=l O^i 

Because of (8) and the fact that P{O^Z) = ^^^^ P(X, Z)P(0|X), the column vector 
dP{Ox, Z)ldO\ is identical to the vector dP{X, Z)ldQ\ . As we have argued when proving 
Claim 1, the vectors {5P(X, Z)/d9l |i=l, . . . , fco} constitute a basis for the /cQ-dimensional 
Euclidian space. This implies that, each of the vectors dP{Ox, Z)/ddl can be represented 
as a linear combination of the vectors {dP{Ox, Z)/d6l |i = 1, . . . , /cq}. Consequently, there 
exist c[ (l<i</co) such that 

^ dP{Ox^Z) , ^ (i) aP(Ox,^) _^ , dP{Ox.Z) 



Hence 



.=1 def^ U ' a^f^ U ' def^ 



^ dP{Ox,Y) , ^ (i) aP(Ox,Y) _^ ,dP{Ox,Y) 



Combining this equation with equation (9), we get 



dPjOx.Y) _^ ,dP{Ox,Y) , ^^^ ^2)dP{Ox,Y) 






Because of (8) and the fact that the fact that P(0, Y) = Y.x P{^i Y)P(0|X), the column 
vector dP{Ox,Y)/d6i is identical to the vector dP{X,Y)/d9\ and the column vector 
dP{Ox,Y)/def'^ is identical to the vector dP{X,Y)/d9f\ Hence 



^0 f^TDt V -\r\ ''^l 



dP{X,Y) _^ , dP{X,Y) '^ ^2) dP{X,Y) 



This contradicts the fact that the vectors in the equation form a basis for the column space 
of Jm2 almost everywhere (see (5) in Section 5.2) Therefore, Claim 3 must be true. □ 
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6. Effective Dimensions of Trees 

Let us use the term tree model to refer to Markov random fields on undirected trees over 
a finite number of random variables. If we root a tree model at any of its nodes, we get a 
tree-structured Bayesian network model. In a tree model, define leaf nodes be those that 
have only one neighbor. An HLC model is a tree model where all leaf nodes are observed 
while all others are latent. 

It turns out that Theorem 1 enables us to compute the effective dimension of any tree 
model. Consider an arbitrary tree model. If some of its leaf nodes are latent, we can remove 
such nodes without affecting its effective dimension. 

After removing latent leaf nodes, all the leaf nodes are observed. If some non-leaf nodes 
are also observed, we can decompose the model into submodels at any observed non-leaf 
node. The following theorem tells us how the model and the submodels are related in terms 
of effective dimensions. 

Theorem 2 Suppose Y is an observed non-leaf node in a tree model M. If M decomposes 
at Y into k submodels Mi, . . . , M^, then 



de{M) = Y^ de{Mi) - {k - 1){\Y\ - 1) 



After all possible decompositions, the final submodels either do not contain latent nodes 
or are HLC models. Effective dimensions of submodels with no latent variables are simply 
their standard dimensions. If an HLC submodel is irregular, we make it regular by applying 
the transformation mentioned at the end of Section 3.2. The transformation does not affect 
the effective dimensions of the submodels. Finally, effective dimensions of regular HLC 
submodels can be computed using Theorem 1. 

Proof of Theorem 2: It is possible to prove this theorem starting from the Jacobian 
matrix. Here we take a less formal but more revealing approach. 

It suffices to consider case of k being 2. The two submodels Mi and M2 share only one 
node, namely Y. Let Oi and O2 be respectively the sets of observed nodes in those two 
submodels excluding Y. Root M at y. Then we have 

p{Y, Oi, 02)P(y) = p(Oi, y)p(02, y). 

Let ^0 be the set of parameters in the distribution P{Y), 9i and 62 be respectively the sets 
of parameters in the conditional probability distributions of nodes in Mi and M2 ■ Consider 
fixing ^0 and letting 61 and 02 vary. In this case, the space spanned by P{Y) consists of only 
one vector, namely ^o itself. Moreover, there is a one-to-one correspondence between vectors 
in the space spanned by P(Y, Oi, O2) and vectors in the Cartesian product of the spaces 
spanned by P{Oi,Y) and P{02,Y). Now let ^o vary. This adds [^1 — 1 dimensions to each 
of the four spaces spanned by P{Y, Oi, O2), P{Y), P{Oi, Y), and i-*(02, Y). Consequently, 
we have 

de{M) = de{Mi) + de{M2) - {\Y\ - 1). 

The theorem is proved. □ 
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7. Concluding Remarks 

In this paper we study the effective dimensions of HLC models. The work is motivated by 
empirical evidence that the BIC behaves quite well when used with several hill-climbing 
algorithms for learning HLC models and that the BICe score sometimes leads to better 
model selection than the BIC score. We have proved a theorem that relates the effective 
dimension of an HLC model to the effective dimensions of two other HLC models that 
contain fewer latent variables. Repeated application of the theorem allows one to reduce 
the task of computing the effective dimension of an HLC model to subtasks of computing 
effective dimensions of LC models. This makes it computationally feasible to compute the 
effective dimensions of large HLC models. In addition, we have proved a theorem about 
effective dimensions of general tree models. This and our main theorem allows one to 
compute the effective dimension of arbitrary tree models. 
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